Tag
6 articles
Galtea raises $3.2M to help enterprises test AI agents, addressing the gap between demo and production performance.
This article explains the trade-offs in AI language model performance, focusing on how models like Grok 4.20 reduce hallucinations but lag behind top-tier models in benchmarks.
A new study by METR reveals that nearly half of AI-generated code that passes industry benchmarks would be rejected by real developers due to quality and maintainability issues.
AI benchmarking startup Arcada Labs is testing five leading AI models as autonomous agents on X, evaluating their real-world social media capabilities.
OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.
A new tutorial from MarkTechPost demonstrates how to use TruLens and OpenAI models to build transparent and measurable evaluation pipelines for LLM applications.